Goto

Collaborating Authors

 recovery rate



Graph Denoising Diffusion for Inverse Protein Folding

Neural Information Processing Systems

Inverse protein folding is challenging due to its inherent one-to-many mapping characteristic, where numerous possible amino acid sequences can fold into a single, identical protein backbone. This task involves not only identifying viable sequences but also representing the sheer diversity of potential solutions. However, existing discriminative models, such as transformer-based auto-regressive models, struggle to encapsulate the diverse range of plausible solutions. In contrast, diffusion probabilistic models, as an emerging genre of generative approaches, offer the potential to generate a diverse set of sequence candidates for determined protein backbones. We propose a novel graph denoising diffusion model for inverse protein folding, where a given protein backbone guides the diffusion process on the corresponding amino acid residue types. The model infers the joint distribution of amino acids conditioned on the nodes' physiochemical properties and local environment. Moreover, we utilize amino acid replacement matrices for the diffusion forward process, encoding the biologically meaningful prior knowledge of amino acids from their spatial and sequential neighbors as well as themselves, which reduces the sampling space of the generative process. Our model achieves state-of-the-art performance over a set of popular baseline methods in sequence recovery and exhibits great potential in generating diverse protein sequences for a determined protein backbone structure.





Graph Denoising Diffusion for Inverse Protein Folding

Neural Information Processing Systems

Moreover, we utilize amino acid replacement matrices for the diffusion forward process, encoding the biologically meaningful prior knowledge of amino acids from their spatial and sequential neighbors as well as themselves, which reduces the sampling space of the generative process.


Optimal Sampling and Clustering in the Stochastic Block Model

Neural Information Processing Systems

This paper investigates the design of joint adaptive sampling and clustering algorithms in networks whose structure follows the celebrated Stochastic Block Model (SBM). To extract hidden clusters, the interaction between edges (pairs of nodes) may be sampled sequentially, in an adaptive manner.


China figured out how to sell EVs. Now it has to bury their batteries.

MIT Technology Review

China figured out how to sell EVs. Now it has to bury their batteries. As early electric cars age out, hundreds of thousands of used batteries are flooding the market, fueling a gray recycling economy even as Beijing and big manufacturers scramble to build a more orderly system. In August 2025, Wang Lei decided it was finally time to say goodbye to his electric vehicle. Wang, who is 39, had bought the car in 2016, when EVs still felt experimental in Beijing. It was a compact Chinese brand.


RadDiff: Retrieval-Augmented Denoising Diffusion for Protein Inverse Folding

arXiv.org Artificial Intelligence

Protein inverse folding, the design of an amino acid sequence based on a target 3D structure, is a fundamental problem of computational protein engineering. Existing methods either generate sequences without leveraging external knowledge or relying on protein language models (PLMs). The former omits the evolutionary information stored in protein databases, while the latter is parameter-inefficient and inflexible to adapt to ever-growing protein data. To overcome the above drawbacks, in this paper we propose a novel method, called r etrieval-a ugmented d enoising diff usion (RadDiff), for protein inverse folding. Given the target protein backbone, RadDiff uses a hierarchical search strategy to efficiently retrieve structurally similar proteins from large protein databases. The retrieved structures are then aligned residue-by-residue to the target to construct a position-specific amino acid profile, which serves as an evolutionary-informed prior that conditions the denoising process. A lightweight integration module is further designed to incorporate this prior effectively. Experimental results on the CA TH, PDB, and TS50 datasets show that RadDiff consistently outperforms existing methods, improving sequence recovery rate by up to 19%. Experimental results also demonstrate that RadDiff generates highly foldable sequences and scales effectively with database size. Proteins are the molecular machines of life, executing a vast array of biological functions dictated by their three-dimensional (3D) structures (Koehler Leman et al., 2023).